Phonetic unit duration adjustment for text-to-speech system

ABSTRACT

Input text is converted to a sequence of representations of syllables or other phonetic units and stored portions of data are retrieved to generate waveforms corresponding to the syllables. In order to determine durations for the syllables, a constant duration is defined corresponding to a regular beat period and adjusted in accordance with the nature of the syllable and/or its context within the sequence.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is concerned with speech synthesis, andparticularly, though not exclusively, with text-to-speech synthesiserswhich operate by concatenating segments of stored speech waveforms.

2. Related Art

Various prior art systems have been devised for converting text tosynthesized speech. While these systems provide associated techniques todetermine the timing and duration of synthesized phonetic units, it isbelieved that there is room for improvement in this regard.

BRIEF SUMMARY OF THE INVENTION

According to the present invention there is provided a speechsynthesiser comprising:

means for supplying a sequence of representations of phonetic units;

means for retrieving stored portions of data to generate waveformscorresponding to the phonetic units;

means for determining durations for the phonetic units: and

means for processing the portions of data to adjust the time durationsof the waveforms according to the determined durations;

wherein the determining means is operable to define a constant durationcorresponding to a regular beat period and to adjust that duration independence on the nature of the phonetic unit and/or its context withinthe sequence.

Preferably the stored data are themselves digitised speech waveforms(though this is not essential and the invention may also be applied toother types of synthesiser such as formant synthesisers). Thus in apreferred arrangement the synthesiser includes a store containing itemsof data representing waveforms corresponding to phonetic sub-units, theretrieving means being operable to retrieve, for each phonetic unit, oneor more portions of data each corresponding to a sub-unit thereof, and afurther store containing for each sub-unit statistical duration dataincluding a maximum value and a minimum value, wherein the determiningmeans is operable to compute for each phonetic unit the sum of theminimum duration values and the sum of the maximum duration values forthe constituent sub-unit(s) thereof and to adjust the said constantduration such that it neither falls below the sum of the minimum valuesnor exceeds the sum of the maximum values.

In the preferred embodiment the phonetic units are syllables and thesub-units are phonemes.

BRIEF DESCRIPTION OF THE DRAWING

One embodiment of the invention will now be described with reference toFIG. 1 which is a block diagram of a speech synthesiser.

DETAILED DESCRIPTION OF AN EXEMPLARY EMBODIMENT

The speech synthesiser of FIG. 1 has an input 1 for receiving input textin coded form, for example in ASCII code. A text normalisation unit 2preprocesses the text to remove symbols and numbers into words; forexample an input “£100 ” will be converted to “one hundred pounds”. Theoutput from this passes to a pronunciation unit 3 which converts thetext into a phonetic representation, by the use of a dictionary or a setof rules or, more preferably, both. This unit also produces, for eachsyllable, a parameter indicative of the lexical stress to be placed onthat syllable.

A parser 4 analyses each sentence to determine its structure in terms ofthe parts of speech (adjectives, nouns, verbs etc.) and generatesperformance structures such as major and minor phrases (a major phraseis a word or group of words delimited by silence). A pitch assignmentunit 5 computes a “salience” value for each syllable based on theoutputs of the units 3 and 4. This value is indicative of the relativestress given to each syllable, as a function of the lexical stress,boundaries between major and minor phrases, parts of speech and otherfactors. Commonly this is used to control the fundamental pitch of thesynthesised speech (though arrangements for this are not shown in theFigure).

The phonetic representation from the unit 3 also passes to a selectionunit 6 which has access to a database 7 containing digitised segments ofspeech waveform each corresponding to a respective phoneme. Preferably(though this is not essential to the invention) the database may containa number of examples of each phoneme, recorded (by a human speaker) indifferent contexts, the selection unit serving to select that examplewhose context most closely matches the context in which the phoneme tobe generated actually appears in the input text (in terms of the matchbetween the phonemes flanking the phoneme in question. Arrangements forthis type of selection are described in our co-pending European patentapplication No. 93306219.2. The waveform segments will (as describedfurther below) be concatenated to produce a continuous sequence ofdigital waveform samples corresponding to the text received at the input1.

The units described above are conventional in operation. However theapparatus also includes a duration calculation unit 8. This serves toproduce, for each phoneme, an output indicating its duration inmilliseconds (or other convenient temporal measure). Its operation isbased on the idea of a regular beat rate, that is, a rate of productionof syllables which is constant, or at least constant over a portion ofspeech. This beat may be viewed as defining a period of time into whichthe syllable must be fitted if possible, though as will be seen, theactual duration will at times deviate from this period. The apparatusshown assumes a fixed underlying beat rate but the setting of this maybe changed by the user. A typical rate might be 0.015 beats/ms (i.e. abeat period of 66.7 ms).

The duration unit 8 has access to a database 9 containing statisticalinformation for each phoneme, as follows:

the minimum segmental duration P_(i,min) of that phoneme

the maximum segmental duration P_(i,max) of that phoneme

the mean or modal segmental duration P_(i,M) of that phoneme it beingunderstood that these values are stored for each phoneme p_(i) (i =1, .. . ,n) of the set P of all legal phonemes. The modal duration is themost frequently occurring value in the distribution of phoneme lengths,this being preferred to the mean. These values may be determined from adatabase of annotated speech samples. Raw statistical values may be usedor smoothed data such as gamma modelled durations may be used. For thebest results this statistical information should be derived from speechof the same style to that to be synthesised; indeed, if the database 7contains multiple examples of each phoneme p_(i) the statisticalinformation may be generated from the contents of the database 7 itself.It should also be mentioned that these values are determined only once.

The duration unit 8 proceeds as follows for each syllable j—the notationassumes that each syllable contains L phonemes (where L obviously variesfrom syllable to syllable) and the l'th phoneme is identified by anindex i(l)—i.e. if phoneme p_(')is found at position 2 in the syllablethen i(2)=3:

(1) determine the minimum and maximum possible duration of the syllablej—i.e. ${Syl}_{j,\min} = {\sum\limits_{l = 1}^{L}p_{{i{(l)}},\min}}$

${Syl}_{j,\max} = {\sum\limits_{l = 1}^{L}p_{{i{(l)}},\max}}$

The maximum and minimum values represent a first set of bounds on thesyllable duration.

(2) Associated with each syllable is a factor indicating the degree ofsalience, obtained from the unit 5; as explained above, it is determinedfrom information indicating how prominent the syllable is within theword and how prominent the word is within the sentence. Thus this factoris used to determine how much a given syllable may be squeezed in time.It is assumed that the salience factor Sal_(j) (for the jth syllable)has a range from 0 to 100. A salience factor of 0 means that thesyllable may be squeezed to its minimum duration Syl_(j,min) , whilst asalience factor of 100 indicates that it can assume the maximum durationSyl_(j,max) , Thus a modified minimum duration is computed as:

Syl′ _(j,min) =Syl _(j,min)+(Syl _(j,max) −Syl _(j,min)).Sal _(j) /100

(3) Calculate the desired duration Syl_(j,C) using the beat period T ifthis lies within the range defined by the modified minimum duration andthe maximum duration, and using the modified minimum or the maximumotherwise. Viz.:

If T<Syl′_(j,min) then

Syl_(j,C)=Syl′_(j,min)

Otherwise

If T>Syl_(j,max) then

Syl_(j,C) =Syl_(j,max)

Otherwise

Syl_(j,C)=T

(4) Once the duration of the syllable has been determined the durationsof the individual phonemes within the syllable must be determined. Thisis done by apportioning the available time Syl_(j,C) among the Lphonemes according to the relative weights of their modal durations:

first, find the proportion r_(l) of the syllable to be occupied by theIth phoneme:$r_{l} = {p_{{i{(l)}},M}/{\sum\limits_{l = 1}^{l}p_{{i{(l)}},M}}}$

The computed duration of the Ith phoneme of the jth syllable is thenobtained from:

P _(i(l),C) =r _(l) ·Syl _(j,C)

Typically, a person does not speak at a constant rate. In particular, anutterance containing a large number of words is spoken more quickly thanan utterance which contains fewer words.

For this reason, in a preferred embodiment of the present invention, afurther modification is made to the phoneme duration P_(i(l),C) independence upon the length of the major phrase which contains thephoneme in question.

In calculating this modification, a percentage increase or decrease inthe phoneme duration is calculated as a simple linear function of thenumber of syllables in the major phrase, with a cut-off at sevensyllables. The greatest percentage increase in the phoneme duration isapplied when there is only one syllable in a major phrase, themodification decreasing linearly as the number of syllables increases upto seven syllables. The modification made to the duration of phonemescontained within a major phrase having more than seven syllables is thesame as that made to a phoneme contained within a major phrase havingseven syllables. It might in some situations be found that a cut offpoint at more or fewer than seven syllables is to be preferred.

In addition, it will be appreciated that non-linear functions mightprovide a better model of the relationship between the number ofsyllables within a major phrase and the duration of the syllables withinit. Also, word groupings other than major phrases may be used.

Once the phoneme duration has been computed (and, in the case of thepreferred embodiment, modified), a realisation unit 10 serves toreceive, for each phoneme in turn, the corresponding waveform segmentfrom the unit 6, and adjust the length of it to correspond to thecomputed (and, possibly modified) duration using an overlap-addtechnique. This is a known technique for adjusting the length ofsegments of speech waveform whereby portions corresponding to the pitchperiod of the speech are separated using overlapping window functionssynchronous (for voiced speech) with pitchmarks (stored in the database7 along with the waveforms themselves) corresponding to the originalspeaker's glottal excitation. It is then a simple matter to reduce orincrease the duration by omitting or as the case may be repeatingportions prior to adding them back together. The concatenation of onephoneme with the next may also be performed by an overlap-add process;if desired the improved overlap-add process described in our co-pendingEuropean patent application No. 95302474.2 may be used for this purpose.

As an alternative, the modification described in relation to thepreferred embodiment of the present invention may be made to the modalduration of the phonemes without calculating the syllable duration.

What is claimed is:
 1. A speech synthesis method comprising: supplying asequence or representations of phonetic units; retrieving storedportions of data to generate waveforms corresponding to the phoneticunits; determining durations for the phonetic units; and processing theportions of data to adjust the time durations of the waveforms accordingto the determined durations; wherein the determining step is operable todefine a constant duration for said phonetic unit, said constantduration corresponding to a regular beat period and selectively independence on the intrinsic duration of the phonetic unit and/or itscontext within the sequence, to carry out a constant duration regulationcalculation.
 2. A speech synthesis method as in claim 1 furthercomprising: identifying major phrases in said sequence; wherein thedetermining step further adjusts said durations for the phonetic unitsin dependence upon the number of phonetic units falling within a majorphrase.
 3. A speech synthesis method as in claim 1 in which the phoneticunits are syllables.
 4. A speech synthesis method as in claim 1including: storing items of data representing waveforms corresponding tophonetic sub-units, the retrieving step retrieving for each phoneticunit, one or more portions of data each corresponding to a sub-unitthereof, and further storing for each sub-unit statistical duration dataincluding a maximum value and a minimum value; wherein the determiningstep computes for each phonetic unit the sum of the minimum durationvalues and the sum of the maximum duration values for the constituentsub-unit(s) thereof and adjusts the said constant duration such that itneither falls below the sum of the minimum values nor exceeds the sum ofthe maximum values.
 5. A speech synthesis method as in claim 4 in whichthe sub-units are phonemes.
 6. A speech synthesis method comprising:supplying a sequence of representations of phonetic units; retrievingstored portions of data to generate waveforms corresponding to thephonetic units; determining durations for the phonetic units; processingthe portions of data to adjust the time durations of the waveformsaccording to the determined durations; wherein the determining step isoperable to define a constant duration corresponding to a regular beatperiod and to adjust that duration in dependence on the intrinsicduration of the phonetic unit and/or its context within the sequence,storing items of data representing waveforms corresponding to phoneticsub-units, the retrieving step retrieving for each phonetic unit, one ormore portions of data each corresponding to a sub-unit thereof, andfurther storing for each sub-unit statistical duration data including amaximum value and a minimum value; wherein the determining step computesfor each phonetic unit the sum of the minimum duration values and thesum of the maximum duration values for the constituent sub-unit(s)thereof and adjusts the said constant duration such that it neitherfalls below the sum of the minimum values nor exceeds the sum of themaximum values; wherein said determining step adjusts the said constantduration value such that it does not fall below a modified minimum valuewhich exceeds the sum of the minimum values to an extent determined bythe context of the phonetic unit.
 7. A speech synthesis methodcomprising: supplying a sequence of representations of phonetic units;retrieving stored portions of data to generate waveforms correspondingto the phonetic units; determining durations for the phonetic units;processing the portions of data to adjust the time durations of thewaveforms according to the determined durations; wherein the determiningstep is operable to define a constant duration corresponding to aregular beat period and to adjust that duration in dependence on theintrinsic duration of the phonetic unit and/or its context within thesequence, storing items of data representing waveforms corresponding tophonetic sub-units, the retrieving step retrieving for each phoneticunit, one or more portions of data each corresponding to a sub-unitthereof, and further storing for each sub-unit statistical duration dataincluding a maximum value and a minimum value; wherein the determiningstep computes for each phonetic unit the sum of the minimum durationvalues and the sum of the maximum duration values for the constituentsub-unit(s) thereof and adjusts the said constant duration such that itneither falls below the sum of the minimum values nor exceeds the sum ofthe maximum values; wherein the statistical duration data include foreach sub-unit a central value, and each sub-unit of a phonetic unit isassigned a duration which is a fraction of the adjusted constant valuefor that phonetic unit in proportion to the ratio of the central valuefor that sub-unit to the sum of the central values for the constituentsub-units of that phonetic unit.
 8. A speech synthesis methodcomprising: supplying a sequence of representations of phonetic units;retrieving stored portions of data to generate waveforms correspondingto the phonetic units; determining durations for the phonetic units; andprocessing the portions of data to adjust the time durations of thewaveforms according to the determined durations; wherein the determiningstep is operable to: a) determine bounds for said duration, said boundsdepending on the intrinsic duration of the phonetic unit and/or itscontext within the sequence; and b) assign a constant durationcorresponding to a regular beat period to said phonetic unit providedsaid constant duration does not transgress said bounds.
 9. A speechsynthesis method as in claim 8 in which the phonetic units aresyllables.
 10. A speech synthesis method comprising: supplying asequence of representations of phonetic units; retrieving storedportions of data to generate waveforms corresponding to the phoneticunits; determining durations for the phonetic units; and processing theportions of data to adjust the time durations of the waveforms accordingto the determined durations; wherein the determining step is operableto: a) determine bounds for said duration, said bounds depending on theintrinsic duration of the phonetic unit and/or its context within thesequence; and b) assign a constant duration corresponding to a regularbeat period to said phonetic unit provided said constant duration doesnot transgress said bounds, the retrieving step retrieving for eachphonetic unit, one or more portions of data each corresponding to asub-unit thereof, and the determining step computing for each phoneticunit the sum of minimum duration values and the sum of maximum durationvalues for the constituent sub-unit(s) thereof and correcting the saidconstant duration if the computed constant duration falls below the sumof the minimum values or exceeds the sum of the maximum values.
 11. Aspeech synthesis method as in claim 10 in which the sub-units arephonemes.
 12. A speech synthesis method as in claim 10 in which thedetermining step is operable to adjust the said constant duration valuesuch that it does not fall below a modified minimum value which exceedsthe sum of the minimum values to an extent determined by the context ofthe phonetic unit.
 13. A speech synthesis method as in claim 10 inwhich: the statistical duration data include for each sub-unit a centralvalue, and including assigning to each sub-unit of a phonetic unit aduration which is a fraction of the adjusted constant value for thatphonetic unit in proportion to the ratio of the central value for thatsub-unit to the sum of the central values for the constituent sub-unitsof that phonetic unit.
 14. A speech synthesiser comprising: means forsupplying a sequence of representations of phonetic units; means forretrieving stored portions of data to generate waveforms correspondingto the phonetic units; means for determining durations for the phoneticunits; and means for processing the portions of data to adjust the timedurations of the waveforms according to the determined durations;wherein the determining means is operable to define a constant durationfor said phonetic unit, said constant duration corresponding to aregular beat period and selectively in dependence on the intrinsicduration of the phonetic unit and/or its context within the sequence, tocarry out a constant duration regulation calculation.
 15. A speechsynthesiser as in claim 14 further comprising: means for identifyingmajor phrases in said sequence; wherein the determining means furtheradjust said durations for the phonetic units in dependence upon thenumber of phonetic units falling within a major phrase.
 16. A speechsynthesiser as in claim 14 in which the phonetic units are syllables.17. A speech synthesis as in claim 14 including: a store containingitems of data representing waveforms corresponding to phoneticsub-units, the retrieving means being operable to retrieve, for eachphonetic unit one or more portions of data each corresponding to asub-unit thereof, and a further store containing for each sub-unitstatistical duration data including a maximum value and a minimum value,wherein the determining means is operable to compute for each phoneticunit the sum of the minimum duration values and the sum of the maximumduration values for the constituent sub-unit(s) thereof and to adjustthe said constant duration such that it neither falls below the sum ofthe minimum values nor exceeds the sum of the maximum values.
 18. Aspeech synthesiser as in claim 17 in which the sub-units are phonemes.19. A speech synthesiser comprising: means for supplying a sequence ofrepresentations of phonetic units; means for retrieving stored portionsof data to generate waveforms corresponding to the phonetic units; meansfor determining durations for the phonetic units; means for processingthe portions of data to adjust the time durations of the waveformsaccording to the determined durations; wherein the determining means isoperable to define a constant duration corresponding to a regular beatperiod and to adjust that duration in dependence on the nature of thephonetic unit and/or its context within the sequence; a store containingitems of data representing waveforms corresponding to phoneticsub-units, the retrieving means being operable to retrieve, for eachphonetic unit, one or more portions of data each corresponding to asub-unit thereof, and a further store containing for each sub-unitstatistical duration data including a maximum value and a minimum value,wherein the determining means is operable to compute for each phoneticunit the sum of the minimum duration values and the sum of the maximumduration values for the constituent sub-unit(s) thereof and to adjustthe said constant duration such that it neither falls below the sum ofthe minimum values nor exceeds the sum of the maximum values; andwherein the determining means is operable to adjust the said constantduration value such that it does not fall below a modified minimum valuewhich exceeds the sum of the minimum values to an extent determined bythe context of the phonetic unit.
 20. A speech synthesiser comprising:means for supplying a sequence of representations of phonetic units;means for retrieving stored portions of data to generate waveformscorresponding to the phonetic units; means for determining durations forthe phonetic units; means for processing the portions of data to adjustthe time durations of the waveforms according to the determineddurations; wherein the determining, means is operable to define aconstant duration corresponding to a regular beat period and to adjustthat duration in dependence on the nature of the phonetic unit and/orits context within the sequence; a store containing items of datarepresenting waveforms corresponding to phonetic sub-units, theretrieving means being operable to retrieve, for each phonetic unit, oneor more portions of data each corresponding to sub-unit thereof, and afurther store containing for each sub-unit statistical duration dataincluding a maximum value and a minimum value, wherein the determiningmeans is operable to compute for each phonetic unit the sum of theminimum duration values and the sum of the maximum duration values forthe constituent sub-unit(s) thereof and to adjust the said constantduration such that it neither falls below the sum of the minimum valuesnor exceeds the sum of the maximum values; and wherein the statisticalduration data include for each sub-unit a central value, and means toassign to each sub-unit of a phonetic unit a duration which is afraction of the adjusted constant value for that phonetic unit inproportion to the ratio of the central value for that sub-unit to thesum of the central values for the constituent sub-units of that phoneticunit.
 21. A speech synthesizer comprising: means for supplying asequence of representations of phonetic units; means for retrievingstored portions of data to generate waveforms corresponding to thephonetic units; means for determining durations for the phonetic units;and means for processing the portions of data to adjust the timedurations of the waveforms according to the determined durations;wherein the determining means is operable to: a) determine bounds forsaid duration, said bounds depending on the intrinsic duration of thephonetic unit and/or its context within the sequence; and b) assign aconstant duration corresponding to a regular beat period to saidphonetic unit provided said constant duration does not transgress saidbounds.
 22. A speech synthesizer as in claim 21 which the phonetic unitsare syllables.
 23. A speech synthesizer comprising: means for supplyinga sequence of representations of phonetic units; means for retrievingstored portions of data to generate waveforms corresponding to thephonetic units; means for determining durations for the phonetic units;and means for processing the portions of data to adjust the timedurations of the waveforms according to the determined durations;wherein the determining means is operable to: a) determine bounds forsaid duration, said bounds depending on the intrinsic duration of thephonetic unit and/or its context within the sequence; and b) assign aconstant duration corresponding to a regular beat period to saidphonetic unit provided said constant duration does not transgress saidbounds, a store containing items of data representing waveformscorresponding to phonetic sub-units, the retrieving means being operableto retrieve, for each phonetic unit, one or more portions of data eachcorresponding to a sub-unit thereof, and a further store containing foreach sub-unit statistical duration data including a maximum value and aminimum value, wherein the determining means is operable to compute foreach phonetic unit the sum of the minimum duration values and the sum ofthe maximum duration values for the constituent sub-unit(s) thereof andto correct the said constant duration if the computed constant durationfalls below the sum of minimum values or exceeds the sum of the maximumvalues.
 24. A speech synthesizer as in claim 23 in which the sub-unitsare phonemes.
 25. A speech synthesizer as in claim 23 in which thedetermining means is operable to adjust the said constant duration valuesuch that it does not fall below a modified minimum value which exceedsthe sum of the minimum values to an extent determined by the context ofthe phonetic unit.
 26. A speech synthesizer as in claim 23 in which: thestatistical duration data include for each sub-unit a central value, andincluding means to assign to each sub-unit of a phonetic unit a durationwhich is a fraction of the adjusted constant value for that phoneticunit is proportion to the ratio of the central value for that sub-unitto the sum of the central values for the constituent sub-units of thatphonetic unit.